Learning to Find Transliteration on the Web

نویسندگان

  • Chien-Cheng Wu
  • Jason S. Chang
چکیده

This prototype demonstrate a novel method for learning to find transliterations of proper nouns on the Web based on query expansion aimed at maximizing the probability of retrieving transliterations from existing search engines. Since the method we used involves learning the morphological relationships between names and their transliterations, we refer to this IR-based approach as morphological query expansion for machine transliteration. The morphological query expansion approach is general in scope and can be applied to translation and transliteration, but we focus on transliteration in this paper. Many texts containing proper names (e.g., “The cities of Mesopotamia prospered under Parthian and Sassanian rule.”) are submitted to machine translation services on the Web every day, and there are also service on the Web specifically target transliteration of proper names, including CHINET (Kwok et al. 2005) ad Livetrans (Lu, Chien, and Lee 2004). Machine translation systems on the Web such as Yahoo Translate (babelfish.yahoo.com) and Google Translate (translate.google.com/translate_t.g) typically use a bilingual dictionary that is either manually compiled or learned from a parallel corpus. However, such dictionaries often have insufficient coverage of proper names and technical terms, leading to poor translation due to out of vocabulary problem. The OOV problems of machine translation or cross language information retrieval can be handled more effectively by learning to find transliteration on the Web. Consider Sentence 1 containing three place names. 1. The cities of Mesopotamia prospered under Parthian and Sassanian rule. 2. 城市繁榮下parthian 達米亞、sassanian統

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Hypothesis Selection in Machine Transliteration: A Web Mining Approach

We propose a new method of selecting hypotheses for machine transliteration. We generate a set of Chinese, Japanese, and Korean transliteration hypotheses for a given English word. We then use the set of transliteration hypotheses as a guide to finding relevant Web pages and mining contextual information for the transliteration hypotheses from the Web page. Finally, we use the mined information...

متن کامل

Learning Transliteration Lexicons from the Web

This paper presents an adaptive learning framework for Phonetic Similarity Modeling (PSM) that supports the automatic construction of transliteration lexicons. The learning algorithm starts with minimum prior knowledge about machine transliteration, and acquires knowledge iteratively from the Web. We study the active learning and the unsupervised learning strategies that minimize human supervis...

متن کامل

Transliteration Using a Network of Phoneme Chunks

In this paper, we present methods of transliteration and back-transliteration. In Korean technical documents and web documents, many English words and Japanese words are transliterated into Korean words. These transliterated words are usually technical terms and proper nouns, so it is hard to find them in a dictionary. Therefore an automatic transliteration system is needed. Previous transliter...

متن کامل

Improving Translation of Unknown Proper Names Using a Hybrid Web-based Translation Extraction Method

Recently, we have proposed several effective Web-based term translation extraction methods exploring Web resources to deal with translation of Web query terms. However, many unknown proper names in Web queries are still difficult to be translated by using our previous Web-based term translation extraction methods. Therefore, in this paper we propose a new hybrid translation extraction method, w...

متن کامل

Machine Transliteration Using Multiple Transliteration Engines and Hypothesis Re-Ranking

This paper describes a novel method of improving machine transliteration by using multiple transliteration hypotheses and re-ranking them. We constructed seven machine-transliteration engines to produce a set of transliteration hypotheses. We then re-ranked the hypotheses to select the correct transliteration hypothesis. We propose a re-ranking method that makes use of confidence-score, languag...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007